wil-meta2.txt
Jun 20, 2017. This supercedes wil-meta.txt as a description of 
 the current digitization wil.txt.
-  'meta' lines added to wil.txt.

wil.txt uses the utf-8 encoding for extended ASCII characters.

The {X...X} style of coding serves several purposes:
{#X#}  200898 : {#X#} devanagari text, coded with HK
                Nov 14, 2014. X is now in SLP1 in wil.txt
{%X%}   10790 : italic text
{@X@}   10790 : bold text
{??}        1 : unreadable text

The following  <> type tags are found in wil.txt:
<H>  47 : letter breaks
<symbol>X</symbol> 1 : Under 'svastika'
<lang n="arabic">X</lang> 5
<lang n="greek">X</lang>  20
<div n="lex"/>  1 (so far).  A line break before a gender specification.

Meta lines: 
Each entry of the digitization is contained within a beginning and ending
markup. As example,
<L>6<pc>001<k1>aMSa<k2>aMSa
{#aMSa#}¦ m. ({#-SaH#})
.²1 A share or portion.
.²2 A part.
.²3 A shoulder, the shoulder blade.
.²4 (In arithmetic) a fraction.
.²5 The numerator of a fraction.
.²6 A degree of latitude or longitude, &c. See {#aMsa.#}
.E. {#aMSa#} to divide, {#ac#} affix.

<LEND>

The ending markup is <LEND>.
The beginning markup contains various identifying fields, expressed in
a <fieldname>fieldvalue format. The fields are:
required fields
  L Cologne record identifier
  pc page-col reference to scanned image
  k1 key1. The headword spelling, in slp1 coding for Sanskrit headwords
  k2 key2. The original headword spelling, either in slp1 or IAST

optional field for homonym
  hom The homonym number (usually a digit). Not present in wil dictionary

---------------------------------

Page breaks are coded as [PageX], where X is page number
from 1 to 982.  Each page has two columns, but the column breaks are
not specified in the digitization.

The lines of the digitization generally represent 'sections' of the text; the
actual line-breaks of the text are not coded. As indicated in the example above,
sections within an entry are generally coded with the 2-character sequence
'.²' at the beginning of lines.

The headwords are ordered according to Sanskrit alphabet ordering.


Non-Devanagari Sanskrit in the body of the entries has been converted 
to IAST unicode, generally as described in https://en.wikipedia.org/wiki/International_Alphabet_of_Sanskrit_Transliteration. 
In this process, the spellings of many words has been modernized, and 
gratuitous variations of spelling have been normalized. 

Here are the extended ASCII characters that occur in wil.txt,
with their approximate frequency.  

¦  (\u00a6) 44578 := BROKEN BAR
°  (\u00b0)     5 := DEGREE SIGN
²  (\u00b2) 44744 := SUPERSCRIPT TWO
¼  (\u00bc)     1 := VULGAR FRACTION ONE QUARTER
½  (\u00bd)     3 := VULGAR FRACTION ONE HALF
Æ  (\u00c6)    44 := LATIN CAPITAL LETTER AE
Ñ  (\u00d1)    31 := LATIN CAPITAL LETTER N WITH TILDE
ä  (\u00e4)     7 := LATIN SMALL LETTER A WITH DIAERESIS
æ  (\u00e6)   163 := LATIN SMALL LETTER AE
ë  (\u00eb)    17 := LATIN SMALL LETTER E WITH DIAERESIS
ï  (\u00ef)    38 := LATIN SMALL LETTER I WITH DIAERESIS
ñ  (\u00f1)    42 := LATIN SMALL LETTER N WITH TILDE
ö  (\u00f6)    27 := LATIN SMALL LETTER O WITH DIAERESIS
ü  (\u00fc)     9 := LATIN SMALL LETTER U WITH DIAERESIS
Ā  (\u0100)  1927 := LATIN CAPITAL LETTER A WITH MACRON
ā  (\u0101)  4151 := LATIN SMALL LETTER A WITH MACRON
Ī  (\u012a)   452 := LATIN CAPITAL LETTER I WITH MACRON
ī  (\u012b)   520 := LATIN SMALL LETTER I WITH MACRON
œ  (\u0153)    53 := LATIN SMALL LIGATURE OE
Ś  (\u015a)  1387 := LATIN CAPITAL LETTER S WITH ACUTE
ś  (\u015b)   313 := LATIN SMALL LETTER S WITH ACUTE
Ū  (\u016a)    55 := LATIN CAPITAL LETTER U WITH MACRON
ū  (\u016b)   243 := LATIN SMALL LETTER U WITH MACRON
α  (\u03b1)     6 := GREEK SMALL LETTER ALPHA
β  (\u03b2)     6 := GREEK SMALL LETTER BETA
γ  (\u03b3)     5 := GREEK SMALL LETTER GAMMA
δ  (\u03b4)     7 := GREEK SMALL LETTER DELTA
ε  (\u03b5)     1 := GREEK SMALL LETTER EPSILON
ζ  (\u03b6)     2 := GREEK SMALL LETTER ZETA
η  (\u03b7)     1 := GREEK SMALL LETTER ETA
κ  (\u03ba)     1 := GREEK SMALL LETTER KAPPA
λ  (\u03bb)     3 := GREEK SMALL LETTER LAMDA
ν  (\u03bd)     1 := GREEK SMALL LETTER NU
ا  (\u0627)     1 := ARABIC LETTER ALEF
ب  (\u0628)     1 := ARABIC LETTER BEH
ت  (\u062a)     1 := ARABIC LETTER TEH
ج  (\u062c)     1 := ARABIC LETTER JEEM
ح  (\u062d)     1 := ARABIC LETTER HAH
ر  (\u0631)     3 := ARABIC LETTER REH
ز  (\u0632)     2 := ARABIC LETTER ZAIN
ف  (\u0641)     1 := ARABIC LETTER FEH
ل  (\u0644)     1 := ARABIC LETTER LAM
م  (\u0645)     1 := ARABIC LETTER MEEM
ن  (\u0646)     1 := ARABIC LETTER NOON
ه  (\u0647)     1 := ARABIC LETTER HEH
و  (\u0648)     1 := ARABIC LETTER WAW
ي  (\u064a)     2 := ARABIC LETTER YEH
Ḍ  (\u1e0c)   131 := LATIN CAPITAL LETTER D WITH DOT BELOW
ḍ  (\u1e0d)   236 := LATIN SMALL LETTER D WITH DOT BELOW
ḥ  (\u1e25)     2 := LATIN SMALL LETTER H WITH DOT BELOW
Ḷ  (\u1e36)     1 := LATIN CAPITAL LETTER L WITH DOT BELOW
ḷ  (\u1e37)     1 := LATIN SMALL LETTER L WITH DOT BELOW
Ḹ  (\u1e38)     1 := LATIN CAPITAL LETTER L WITH DOT BELOW AND MACRON
Ṃ  (\u1e42)    36 := LATIN CAPITAL LETTER M WITH DOT BELOW
ṃ  (\u1e43)    52 := LATIN SMALL LETTER M WITH DOT BELOW
Ṅ  (\u1e44)    43 := LATIN CAPITAL LETTER N WITH DOT ABOVE
ṅ  (\u1e45)   120 := LATIN SMALL LETTER N WITH DOT ABOVE
Ṇ  (\u1e46)  1112 := LATIN CAPITAL LETTER N WITH DOT BELOW
ṇ  (\u1e47)  2216 := LATIN SMALL LETTER N WITH DOT BELOW
Ṛ  (\u1e5a)   443 := LATIN CAPITAL LETTER R WITH DOT BELOW
ṛ  (\u1e5b)   146 := LATIN SMALL LETTER R WITH DOT BELOW
Ṣ  (\u1e62)   992 := LATIN CAPITAL LETTER S WITH DOT BELOW
ṣ  (\u1e63)   563 := LATIN SMALL LETTER S WITH DOT BELOW
Ṭ  (\u1e6c)    76 := LATIN CAPITAL LETTER T WITH DOT BELOW
ṭ  (\u1e6d)   156 := LATIN SMALL LETTER T WITH DOT BELOW
‘  (\u2018)     8 := LEFT SINGLE QUOTATION MARK
’  (\u2019)     7 := RIGHT SINGLE QUOTATION MARK
“  (\u201c)     2 := LEFT DOUBLE QUOTATION MARK
”  (\u201d)     2 := RIGHT DOUBLE QUOTATION MARK
卐  (\u5350)     1 := CJK UNIFIED IDEOGRAPH-5350
